Short subsequences in genomes: How random are they?

نویسندگان

  • Yuriy Fofanov
  • Yi Luo
  • Charles Katili
  • Jim Wang
  • Yuri Y. Belosludtsev
  • Thomas F. Powdrill
  • Viacheslav Fofanov
  • Tong-Bin Li
  • Sergey Chumakov
  • B. Montgomery Pettitt
چکیده

A comparative statistical analysis of the presence of all possible short subsequences of length 5 to 20 nucleotides in the genomes of more than 250 microbial, viral and multicellular organisms was performed. A remarkable similarity of the presence/absence distributions for different n-mers in all genomes was found. The same analysis applied analytically and numerically to random sequences also shows a similar shape of the distribution, yielding the random boundary, with differences that correlate with biology. We hypothesize that the presence/absence distribution of n-mers in all genomes considered (provided that the condition M<<4 holds, where M is the total genome sequence length) can be treated as nearly random. The relative deviation of the frequency of presence of n-mers from the purely random distribution can be used as a measure of “non-randomness” or selfsimilarity of a genome. Our results indicate that larger genomes are often less random than shorter ones. There is supplementary material. Accession number requested. Introduction Statistical analysis of the appearance of short subsequences in different DNA sequences, from individual genes to full genomes is important for various reasons. Applications include PCR primer (Fislage 1998; Fislage et al. 1997) and microarray probe design (Southern 2001). Several attempts (Deschavanne et al. 1999; Karlin and Ladunga 1994; Karlin and Mrazek 1997; Nakashima et al. 1997; Nakashima et al. 1998; Nussinov 1984; Sandberg et al. 2001) have been made to employ the frequency distribution of short subsequences (n-mers) to identify species with relatively short genome sizes (microbial). In such an approach, the shape of the frequency distribution for certain short subsequences: 2-4mers (Deschavanne et al. 1999; Karlin and Ladunga 1994; Karlin and Mrazek 1997; Nakashima et al. 1997; Nakashima et al. 1998; Nussinov 1984) and 8-9-mers (Deschavanne et al. 1999; Sandberg et al. 2001) have been used to decide what microbial genome one is dealing with, based on a given piece of genome or a whole genome. Many sequencing projects are in progress and more full genomes have recently become available. The several hundred projects completed so far provide sufficient material to consider them from a statistical viewpoint. Yet, we are still far from having a complete or even reasonable statistical picture. There are simply too many species and variations yet to be sequenced. Here we present the results of the comparative statistical analysis of the presence/absence of all possible n-mers (n=5-20) for all genomes available (before May 2002) in the NCBI [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome], including microbial (76 genomes), viral (176 genomes), and five genomes of multicellular organisms. Let us stress that we do not consider the number of appearances of n-mers in a genome (frequency of appearance), but just the information whether the given n-mer is present or absent (frequency of presence) in a given genome. It is well-known that when genome size M > 4, the appearance of n-mers in various genomes are not random (Karlin and Ladunga 1994; Karlin and Mrazek 1997; Nakashima et al. 1997; Nakashima et al. 1998; Nussinov 1984). The basic motivation of our analysis is to explore the statistical properties of the presence of longer n-mers if the condition M<<4 is held. There are several reasons by which one could expect that the distributions of presence of longer n-mers are also not random. First, genomes (especially large ones) contain structural repeats. Second, since the occurrence statistics for short oligonucleotides (2and 3-mers) is not random, this affects the occurrence distributions for longer n-mers, since they contain 2and 3-mers as structural elements. However, our analysis of more than 250 genomes of microbial, viral and multicellular organisms shows that the distributions of presence in the range M<<4 remains nearly random or at least contain a strong random component. Results Microbial and viral genomes. We have calculated the number of all distinct 7 15 -mers present in each of the viral and microbial genomes. Tables 1 and 2 contain representative results for some of the analyzed genomes (microbial and viral), for n = 8 and 12. Complete tables including all of the 252 genomes can be found on a supplementary data website (http://www.bioinfo.uh.edu/publications/how_random_are_genomes/). It is worth mentioning that as n increases, the total number of possible n-mers, 4, strongly exceeds the total sequence length M and most of the possible n-mers do not appear at all because the maximum number of n-mers contained in this sequence is M-n+1 ≈ M . Moreover, for a reasonably high ratio, M n 4 , most of the n-mers which appear tend to appear only once, in accordance with the fact that the number of present n-mers becomes very close to M (see Tables 1,2 and supplementary data). That is why we have chosen to use the statistics for “present/absent” (frequency of presence) in our analysis instead of the usual “frequency of appearance”, which is reasonable for short n-mers (total sequence length M > 4). We give precise definitions of these quantities in the Appendix. We now consider the results obtained for different n-mers in the various genomes. We plot the frequency of presence, f, of n-mers in genomes (the number of different n-mers present in a given genome over the total number of n-mers, 4) against the ratio 4/M. Figures 1-3 correspond to the microbial, RNA containing viruses and DNA containing viruses, respectively. The analytical distribution that corresponds to the frequency of presence of n-mers in a purely random “genome” (see Appendix) is also shown for comparison in all figures. Note the extraordinary similarity between these plots. All of the different genomes form a well-defined pattern, when plotted against the ratio 4/M and not against the size of the genome or the length of the n-mer separately. Multicellular organisms. For much longer genomes of multicellular organisms practically all n-mers for n < 12 are present. Therefore, we have calculated the number of distinct 13 20 -mers present in each genome. The results are shown in Figure 4 and Table 3. In addition to that, we performed the same calculation for each human chromosome separately (see Figure 5 and Table 4). Note that the well-pronounced pattern can be observed in all these figures. It is noteworthy that multicellular organisms, especially rice and human, demonstrate much higher systematic deviation from the random boundary. Discussion A very similar rough shape of the dependences in Figures 1-5 can be observed. This remarkable similarity leads us to the hypothesis that the frequency f of presence/absence of relatively long n-mers (M < 4) can be treated as a result of a random process, or at least may contain a strong random component. This assumption motivated us to perform the following Monte Carlo simulation and analytical analysis. We generated 100,000 random sequences of varying length M (from M=1Kb to M=10Mb), and applied to them the same analysis as for real genomes. We considered two cases: First, we used equal probabilities, pi, of appearance of every nucleotide (pa= pc= pt= pg= 0.25) to generate random sequences. Second, to make our random sequences closer to real genomes, we calculated probabilities for each nucleotide in the three groups (see supplementary data) of genomes mentioned above (microbial, DNA viruses and RNA viruses) and also used them for our simulations. It turns out that the difference between these two simulations is negligibly small. This is, in fact, natural for actual probabilities that are close to 0.25; namely, for all cases, 0.22 < pi < 0.29. The results of the simulations fit the real data remarkably well. In fact, the frequencies of presence of n-mers, f, in various genomes nearly belong to the same universal curve representing the random boundary (always being below it). The analytical derivation for this curve can be found in the Appendix. Assuming equal probabilities of appearance of every nucleotide, we have (in full agreement with the Monte

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Algorithm for the Analysis of the Presence of Short Oligonucleotide Subsequences in Genomic Sequences

Statistical analysis of the appearance of short subsequences in different DNA sequences, from individual genes to full genomes, is important for various reasons. Applications include PCR primers and microarray probes design. Moreover, the distribution of short subsequences (n-mers) in a genome can be used to distinguish between species with relatively short genome sizes (e.g., viruses and micro...

متن کامل

How independent are the appearances of n-mers in different genomes?

MOTIVATION Analysis of statistical properties of DNA sequences is important for evolutional biology as well as for DNA probe and PCR technologies. These technologies, in turn, can be used for organism identification, which implies applications in the diagnosis of infectious diseases, environmental studies, etc. RESULTS We present results of the correlation analysis of distributions of the pre...

متن کامل

On the CRAY-System Random Number Generator

1. A Short Overview In the present paper we study special subsequences of the CRAY-system random number generator RANF. Such subsequences are used to obtain parallel streams of random numbers. RANF is a linear congruential generator (LCG), and hence it is well known that overlapping s-tuples of random numbers generated from RANF produce grid structures in dimension s ≥ 2. In Section 3 we give a...

متن کامل

Evolutionary dynamics of selfish DNA explains the abundance distribution of genomic subsequences

Since the sequencing of large genomes, many statistical features of their sequences have been found. One intriguing feature is that certain subsequences are much more abundant than others. In fact, abundances of subsequences of a given length are distributed with a scale-free power-law tail, resembling properties of human texts, such as Zipf's law. Despite recent efforts, the understanding of t...

متن کامل

Agile Prediction of Ongoing Temporal Sequences Based on Dominative Random Subsequences

This paper identifies a new paradigm of prediction, Agile Prediction of ongoing temporal sequences, which achieves an acceptable accuracy just by the historical subsequences as short as possible and as close to the predicted time point as possible. To address agile prediction, a new concept, Dominative Random Subsequence (DRS for short), is first introduced to capture the local influence and lo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004